The Role of Neural Networks in the Interpretation of Antique Handwritten Documents

نویسندگان

  • Pilar Gómez-Gil
  • Guillermo de-los-Santos-Torres
  • Jorge Navarrete-García
  • Juan Manuel Ramírez-Cortés
چکیده

The need for accessing information through the web and other kind of distributed media makes it mandatory to convert almost every kind of document to a digital representation. However, there are many documents that were created long time ago and currently, in the best cases, only scanned images of them are available, when a digital transcription of their content is needed. For such reason, libraries across the world are looking for automatic OCR systems able to transcript that kind of documents. In this chapter we describe how Artificial Neural Networks can be useful in the design of an Optical Character Recognizer able to transcript handwritten and printed old documents. The properties of Neural Networks allow this OCR to have the ability to adapt to the styles of handwritten or antique fonts. Advances with two prototype parts of such OCR are presented. The Problem of Antique Handwritten Currently, web distribution of old documents is limited to a scanned image of the document because most of the commercial Optical Character Recognizers (OCR) do not obtain good recognition rates with old handwritten documents or with documents using old styles of fonts. The recognition of old handwritten and printed documents is a challenge in pattern recognition, due to special characteristics that this recognition problem presents. Figure 1 shows an example of an old telegram, written by Gral. Porfirio Díaz, president of Mexico at the beginning of XX Century. Even for a non expert person, who does not have some previous 2 Gómez-Gil, Pilar; De los Santos-Torres, Guillermo; Navarrete-García, Jorge; Ramírez-Cortés, Manuel (2) knowledge of this kind of writing, is very difficult to interpret the content of this document. Digital processing of old documents faces, among others, the following conditions: • Old documents have been damaged with the pass of time. In most cases they present spots, color of paper have changed, or their texture is deteriorated. • Digitalization process requires special cares to protect the documents. The production of a digital image that will feed the OCR is, by itself, a delicate process. It requires a special kind of scanner, which would not touch the document. • The recognition process of old documents is off-line. There is no information about the dynamics of the writing or the pressure used by the writer. Fig. 1. An example of a telegram written by Porfirio Díaz at the beginning of XX century [1] Added to these conditions, there are also special complications during the recognition of old handwriting. Some of them are [2]: • Old styles of handwriting have a lot of ornaments. • Fonts are not uniform. For example, same character may look different in different places of a word, in different words or in different documents. Notice that this situation is presented in any kind of handwriting, and is much stronger if documents came from different writers. The Role of Neural Networks in the Interpretation of Antique Handwritten Documents 3 • The shape and style of writing may be different even for the same person depending on environmental factors, mood, type of pens, age, etc. • Character segmentation requires extra procedures, besides the common ones as identification of valleys and hills, due to the styles of different letters. • In some patterns it is noticed that different classes of characters are very similar in shape. Figure 2 shows some examples of handwritten words written by the same writer at different moments and documents. Notice that some letters have different shapes depending on their positions in the word, and when presented in different words. Some letters may be confused with a connection and some letters may be “embedded”, looking two of them as one character. Therefore, in terms of a pattern recognition problem we have that: • There are no evident prototypes to define each class • The variance among members of the same class is greater than expected values • Common similarity metrics, as Euclidian distance, are sometimes useless because it may be greater for patterns belonging to same class than for patterns belonging to different classes. An OCR for antique handwritten documents The research group of Neural Networks and Pattern Recognition at Universidad de las Américas, Puebla, is currently working with the construction of an OCR able to recognize antique handwritten and printed documents. This OCR will be useful to our library, which posses a huge amount of such historical documents [3]. We propose the construction of an adaptive OCR, called Priscus (latin word meaning “antique”) that have the following components (see figure 3): − Digitization. Creation of a color or gray level image of the document to be recognized. − Pre-processing. Cleaning of image, noise reduction and black and white conversion of the image. − Segmentation of words. Given a binary map, this process obtains the words that are presented in the image. 4 Gómez-Gil, Pilar; De los Santos-Torres, Guillermo; Navarrete-García, Jorge; Ramírez-Cortés, Manuel (2) − Segmentation training. Adaptive system that learns to identify segmentation points in a word, based on the handwriting or font presented to the OCR. − Character segmentation. This process obtains possible segments that may contain characters, based on the knowledge obtained from the segmentation training. − Recognizer training. Adaptive system that learns to identify characters from segments obtained from the binary image of the document. − Recognition of characters. It receives segments of words, extract features of them and decide the most likely characters. − Identification of words. Based on possible words obtained by the recognizer and a dictionary, this process decides the most likely words. − Correction of style. Based on the identified most likely words and grammar rules, this process creates well formed sentences, obtained a transcription of the document. At this point, we have focused our research in the segmentation and character recognition components, using artificial neural networks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Handwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns

The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...

متن کامل

Neural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten

Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007